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Abstract 

Design, execution and analysis of clinical studies involves several stakeholders with different professional backgrounds. 
Typically, principle investigators are familiar with standard office tools, data managers apply electronic data capture (EDC) 
systems and statisticians work with statistics software. Case report forms (CRFs) specify the data model of study subjects, 
evolve over time and consist of hundreds to thousands of data items per study. To avoid erroneous manual transformation 
work, a converting tool for different representations of study data models was designed. It can convert between office 
format, EDC and statistics format. In addition, it supports semantic annotations, which enable precise definitions for data 
items. A reference implementation is available as open source package ODMconverter at http://cran.r-project.org. 
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Introduction 

Several stakeholders are involved in clinical trials, in particular 
principal investigators (Pis), data managers and statisticians. Based 
on their medical background, principal investigators describe 
informally what data need to be collected to fulfill the study 
objective. This informal description of a data model is discussed 
and refined together with data managers and statisticians. The 
result of this iterative and interactive process between principal 
investigators, data managers and statisticians is a set of case report 
forms (CRFs) for each study. From an informatics point of view, 
these CRFs specify the data model of study subjects. All data items 
in those CRFs need to be well defined, including permissible 
values for each item. 

Data models in clinical studies are increasingly complex. Since 
introduction of the European Clinical Trials Directive (2001/20/ 
EC), the average length of CRFs increased from 55 pages (1999— 
2002) to 180 pages (2003-2006) per trial [1], associated with major 
additional costs [2]. Under the assumption that a typical CRF 
page contains 20-50 items, this corresponds to 3600 to 9000 data 
items per trial. Obviously, the number of data items is associated 
with the amount of data management work, which is one of the 
major cost factors in clinical trials [3] . CRFs define what data will 
be collected for the study and therefore determine what data items 
are available for statistical analysis at the end of the study. For 
design, execution and analysis of a study different representations 
of the data model are needed. In the design phase, Pis typically 



apply standard office tools (like word processing or spreadsheet 
programs) to describe what kind of data need to be collected for a 
study. For study execution, data managers are working with 
electronic data capture (EDC) systems and statisticians apply 
dedicated statistical software for data analysis. Therefore in a 
typical study setting at least three different representations of the 
data model are created and need to be updated continuously: The 
study data model in office format, EDC and statistics format. 

The objective of this work is to develop and assess automated 
methods to transform data models suitable for principal investi- 
gators, data managers and statisticians. These transformations 
should preserve the semantics of the data model, therefore 
semantic annotations should be included in this transformation 
process. 

Methods 

Data Models for Data Management in Clinical Studies 

Electronic Data Capture (EDC) systems are applied to provide 
electronic case report forms (eCRFs). These systems are custom- 
ized by data managers for each clinical study. EDC systems for 
clinical trials need to be validated according to regulations from 
U.S. Food and Drug Administration (FDA) [4] and European 
Medicines Agency (EMA) [5] . In cooperation with FDA and EMA 
the Clinical Data Interchange Standards Consortium (CDISC) 
defined the operational data model (ODM) [6], an international, 
open standard for metadata and data in clinical studies. CDISC 
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ODM is supported by many commercial EDC systems, for 
example medidata Rave(R) [7], XClinical Marvin [8] and 
secuTrial(R) [9]. In addition, CDISC ODM can be semantically 
annotated [10]. For these reasons an automated transformation 
method for data models in clinical studies should be able to process 
data models in ODM format. Consequendy, CDISC ODM was 
selected as data format for EDC systems in the reference 
implementation. 

Statistics Software in Clinical Studies 

At present, SAS [11] and IBM SPSS [12] are commonly used 
commercial software packages for data analysis in clinical studies. 
To enable wide spread use of a reference implementation, an open 
source system is preferable. R [13] is the open source version of S- 
Plus [14], another well-known statistical software tool. R enables 
to export/import datasets to/from IBM SPSS and SAS. Therefore 
R was chosen to represent the study data model for statistical 
analysis. The data model is represented by an R data frame. 

Semantic Annotations for Data Models 

Data models for study subjects are denned by CRFs. Each CRF 
consists of data items (for example "patient gender"), which can be 
organized in item groups (for example "demographics"). Each 
data item is associated with a set of permissible values (for example 
"male", "female"). In principle, at each level semantic annotations 
from healthcare terminologies can be added to define the 
semantics of data items, item groups and permissible values. 

These annotations can help to overcome the ambiguities of 
natural language and enable more precise specifications of data 
models. Item names can be ambiguous, for example "length" can 
refer to length of an arm or length of a leg; abbreviations can be 
plurivalent, for instance "MS" can denote multiple sclerosis or 
mitral stenosis; some data items can be determined in different 
ways, e.g. blood pressure can be measured in different positions 
(sitting, lying, etc.) and with different methods (non-invasive, 
invasive). Semantic codes can provide references to detailed 
specifications of data items, both regarding medical concepts (what 
is the contents of this item?) and permissible values for each item. 
In addition, semantic codes can be directiy processed by computer 
programs and used for comparisons and transformations of data 
models. Each semantic annotation consists of a terminology 
version and an associated code value. 

SNOMED CT [15] is a commonly used healthcare terminol- 
ogy, which can be applied for semantic annotations. SNOMED 
CT codes are characterized by a certain terminology version (for 
example "SNOMED CT 2010_0731") and a code value (for 
instance "248153007" to represent "male"). Logical Observation 
Identifiers Names and Codes (LOINC(R)) [16] is another code 
system which can be applied for semantic annotations of data 
models. In particular, LOINC(R) provides a large variety of codes 
regarding laboratory procedures. The Unified Medical Language 
System (UMLS(R)) [17] is a Metathesaurus consisting of terms and 
codes from more than 100 different healthcare terminologies. 
Therefore the UMLS provides a unique richness of semantic codes 
(>1.4 Mio. concept codes as of July 2013). 

Data Model Transformation 

To transform data models between the various representations 
(EDC, statistics, office), R [13] functions were designed as public 
reference implementation. These programs contain parsers for the 
different formats with file-based input and output. Regarding the 
office format, a specific Microsoft Excel template was designed to 
capture semantic annotations (also available in csv-format for 
portability). In principal, this reference implementation can be 



used with any medical terminology consisting of terms and 
associated codes. 

Evaluation Approach 

If an automated transformation of study data models in office 
format, EDC and statistics format is feasible, then transformation 
from EDC into statistics format and back again into EDC format 
should result in the same EDC representation. 

Similarly, transformation from EDC format into office format 
and back again into EDC format should result in the same EDC 
representation. In particular, semantic annotation at the various 
levels (itemgroup, item, permissible values) should be preserved. 

This evaluation procedure was applied to a simple data model 
with few data items and then to a random sample of 1 0 real-world 
ODM files from a public portal for medical data models in ODM 
format [18]. As a third evaluation step, the transformation from 
office into EDC format was tested for approximately 400 forms 
from clinical trials. 

Results 

Reference Implementation for Study Data Model 
Transformations 

A reference implementation for automatic transformation of 
data models with semantic annotations for principal investigators, 
data managers and statisticians was developed. It is implemented 
in R and available as open source package ODMconverter at 
http://cran.r-project.org. This software enables to transform a 
data model for study subjects back and forth between different 
representations: office format for principal investigators, EDC 
format for data managers and statistics format for statisticians. 
Semantic annotations are preserved by these transformations. 

Table 1 presents a simplified example of a data model in office 
format. It is a simple spreadsheet which contains few administra- 
tive information about the study and then basically a catalogue of 
data items. The description and selection of relevant data items 
requires medical expertise, therefore this representation of the data 
model needs to be editable by medical personnel without special 
computer skills. In addition to item descriptions also semantic 
codes can be provided. These codes can be looked up with various 
tools, for instance using the NCImetathesaurus [19]. Again, 
selection of appropriate semantic codes from healthcare terminol- 
ogies like SNOMED CT requires medical expertise and cannot be 
done by data managers or statisticians alone. 

When a study protocol is completed and approved, the study 
database needs to be implemented. CDISC ODM is an open 
standard for study data models and endorsed by regulatory 
agencies, therefore it was chosen as EDC format in the reference 
implementation. The software package ODMconverter provides a 
function office20DM which converts the format presented in 
table 1 into ODM format. ODM files can be directiy imported 
into several available EDC systems to setup the study database. 

When the data collection of a study is completed and all 
activities to achieve high data quality are finished, the database is 
closed and the data set is handed over to a statistician. The data set 
needs to be transferred from the EDC system into a statistical 
software package. At this point, a transformation of the study data 
model from EDC format (Figure 1) into statistics format is 
required. The software package ODMconverter provides a 
function ODM2R for this task. Figure 2 presents the transforma- 
tion result of the data model from Figure 1 into an R data frame. 
As a specific feature all semantic annotations from previous steps 
are preserved. 
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Table 1. Simplified example of data model in office format (spreadsheet). 







A 


B 


C 


D 


E F 


1 


StudyOID 


S.0000 








2 


Sponsor 


Testsponsor 








3 


Condition 


Testcondition 








4 


StudyName 


ODM Test Study 








5 


StudyDescription 


Test of ODM tools 








6 


Form 


ODM-Test 








7 


FirstName 


Test 








8 


LastName 


Testname 








9 


Organization 


Test organization 








10 


11 


Type 


Name 


en 


UMLS CUI 


SNOMED CT 2010_0731 LOINC 


12 


itemgroup 


Info 


General Information 


C0332118 


106227002 


13 


boolean 


Willingness 


Willingness to participate 

in clinicial 

trials 


C1516879 




14 


integer 


Age 


Age 




102518004 


15 


date 


DOB 


Date of Birth 




152322001 


16 


integer 


Gender 


Gender 




139865004 


17 


codelistitem 


1 


male 


C0024554 


248153007 


18 


codelistitem 


2 


female 


C0015780 


248152002 


19 


string 


DiagnosisTx 


Diagnosis text 




439401001 


20 


string 


DiagnosisCd 


Diagnosis code 






21 


float 


Crea 


Creatinine 




38483-4 


22 


time 


labTime 


Time of lab value 







The header (line 1-9) contains general information about the study. Line 13-22 provide data items of different data types (column A). Column C presents item labels 
(en = english). Columns D,E,F contain semantic codes for each data item. 
doi:1 0.1 371 /joumal.pone.0090492.t001 



1 

2 
3 
4 
5 


<?xml version-"!. 0" encoding="UTF-8" ?> 

<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" Description="ODM-Test S.0000 Testcondition" 

jODMVersion=' , 1.3.1" CreationDateTime= ,, 2013-05-27T21:38:03+01:00" FileOID="ODM-Test S.OOOO.xml" FileType="Snapshot"> 
jj<StudyOID="S.0000"> 
j <GlobalVariables> 




10 


; <BasicDefinitions> 




15 
16 
17 


<MetaDataVersion OID="MD.1" Name="Metadataversion"> 

<Protocol><StudyEventRef StudyEventOID="SE.1" OrderNumber="1" Mandatory- Yes" /></Protocol> 
<StudyEventDef OID="SE.1" Name="Testcondition" Repeating="No" Type="Unscheduled"> 




20 


<FormDef OID="F.1" Name="ODM-Test" Repeating="No"> 




23 


<ltemGroupDef OID="IG.1" Name="lnfo" Repeating="No"> 




39 
40 
41 
42 
43 
44 
45 


r <ltemDef OID="I.1001" Name="Willingness" DataType="boolean"> 
<Question> 

<TranslatedText xml:lang="en">Willingness to participate in clinicial trials</TranslatedText> 
</Question> 

<Alias Context="UMLS CUI" Name="C1516879" /> 
</ltemDef> 

<ltemDef OID="I.1002" Name="Age" DataType="integer"> 




52 


|i<ltemDef OID="I.1003" Name="DOB" DataType="date"> 




59 


<ltemDef OID="I.1004" Name-'Gender" DataType="integer"> 




67 


"<ltemDef OID="I.1005" Name="DiagnosisTx" DataType="string"> 




74 


<ltemDef OID="I.1006" Name-'DiagnosisCd" DataType="string"> 




80 


<ltemDef OID="I.1007" Name="Crea" DataType="float"> 




87 


i<ltemDef OID="I.1008" Name="labTime" DataType="time"> 




93 


<CodeList OID="CL.1" Name="Gender" DataType="string"> 




111 
112 
113 
114 


</MetaDataVersion> 
</Study> 




<AdminData> 




128 


</ODM> 


• 



Figure 1. Example of data model in CDISC ODM-format. It consists of one form ("ODM-Test") with one itemgroup ("Info") and 8 data items. 
Details for item 1.001 are displayed, including item name, detailed description in english and its associated UMLS code. 
doi:1 0.1 371 /journal.pone.0090492.g001 
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-|n|x 



'data. frame 
$ I. 1001: 
$ 1.1002: 
$ 1.1003: 
$ 1.1004: 
attr( 

- attr( 

- attrt 

- attrt 
,- attr( 

$ 1.1005 
$ 1.1006 
$ 1.1007 
$ 1.1008 

- attr(* 

- attr(* 

- attr(* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

- attrt* 

>l 



0 obs. of 8 variables: 

logi 
int 
chr 

Factor w/ 2 levels 
"labels") = chr 
"labels2")= chr 
"Alias_Contexts 
"UMLS CUI")= chr 



"1","2": 
"male" "female" 

"maennlich" "weiblich" 
)= chr "UMLS CUI" "SNOMED CT 2010_0731" 
C0024554" "C0015780" 

"248153007" "248152002" 



"SNOMED CT 2010_0731")= chr 

chr 
chr 
num 
chr 

"StudyOID")= chr "S.0000" 
"Sponsor") = chr "Testsponsor" 
"Condition") = chr "Testcondition" 
"StudyName") = chr "ODM Test Study" 
"StudyDescription") = chr "Test of ODM tools" 
"Form")= chr "ODM-Test" 
"FirstName") = chr "Test" 
"LastName") = chr "Testname" 
"Organization")^ chr "Test organization" 

'Varlabels") = chr "Willingness" "Age" "DOB" "Gender" ... 
'Varlabels_en")= chr "Willingness to participate in clinicial trials" 
'Varlabel s_de") = chr "Willingness to participate in clinicial trials" 
"DataTypes")= chr "boolean" "integer" "date" "integer" ... 
"itemgroups")= chr "IG.l" "IG.l" "10.1" "IG.l" ... 
"itemgroups_unique")= chr "IG.l" 
"itemgroupnames_unique")= chr "info" 

"itemgroupnames_unique_en") = chr "General Information" 
"itemgroupnames_unique_de") = chr "Allgemeine Angaben" 
"IGAlias")= Chr "IG.l" "UMLS CUI" "C0332118" "IG.l" ... 
"Alias_Items")= chr [1:6, 1:3] "I. 1001" "1.1002" "1.1003" "1.1004" ... 



"Age" "Date of Birth" "Gender" . . . 
"Alter" "Geburtsdatum" "Geschlecht" 



A 



Figure 2. Example of data model in statistics format. An R data frame is provided with 8 variables (1.001 ... I.008). Labels for variables and 
permissible values are defined, for instance "male" and "female" for item 1.1004 (Gender). General information about the study like "StudyName" is 
provided as attribute of this data frame. 
doi:1 0.1 371 /journal.pone.0090492.g002 



These data model transformations can be inverted, again with 
preservation of semantic annotations. For this purpose, package 
ODMconverter provides functions R20DM and ODM2office. 

Evaluation 

As a first evaluation step, a simple data model with 8 data items 
was converted from office format into ODM, then into an R data 
frame, then back into ODM and finally into office format. All 
intermediate files were checked and verified manually. 



As a second evaluation step, this data model conversion process 
was applied to a random sample of 10 "real-world" ODM-files 
from a public ODM-Portal [20], see Table 2. 

These 1 0 files in ODM format were converted into csv(comma 
separated value)- format (function ODM2office) and then back 
into ODM format (function office20DM). All 10 files were 
transformed into an R data frame (function ODM2R) and then 
back into ODM format (function R20DM). All intermediate files 
were checked and verified manually using a standard text editor 



Table 2. Trial IDs, medical condition, number of items and number of annotation codes regarding 1 0 forms in ODM format, which 
were used for the evaluation (randomly selected from www.medical-data-models.org). 



Trial ID 


Medical Condition 


Number of items 


Number of annotation codes 


NCT00824083 


Ewing-Sarcoma 


5 


44 


NCT00980135 


Atopic Dermatitis 


12 


135 


NCT01 104584 


Breast Cancer 


17 


219 


NCT01 147939 


Acute Myeloid Leukemia 


27 


306 


NCT01 179620 


Renal Dialysis 


8 


53 


NCT01 283724 


Endometriosis 


11 


172 


NCT01 324947 


Multiple Myeloma 


33 


333 


NCT01361334 


Acute Myeloid Leukemia 


28 


376 


NCT01403376 


Multiple Sclerosis 


16 


163 


NCT01408095 


Diabetes Mellitus, Type 2 


27 


355 



doi:1 0.1 371 /journal.pone.0090492.t002 



PLOS ONE | www.plosone.org 4 February 2014 | Volume 9 | Issue 2 | e90492 



ODMconverter 



(Notetab++ version 5.6.8) and Microsoft Excel (version 2010). A 
recently developed method to automatically compare medical 
forms [21] was applied to verily that all generated ODM files 
contained identical items and semantic annotations. 

As a third step, the transformation from office into EDC format 
was tested for a larger set of forms from clinical trials. Approximately 
400 eligibility forms from clinical trials with active participation from 
Munster University Hospital were identified in the Internet [22]. 
These forms were manually annotated with semantic codes, in 
particular UMLS and SNOMED CT codes using Microsoft Excel 
templates. Using ODMconverter, these files were converted into 
ODM format and then uploaded into an Internet portal [20] . 

Discussion 

With the proposed reference implementation and its technical 
evaluation we demonstrated that automatic transformation of data 
models with semantic annotations for principal investigators, data 
managers and statisticians is feasible. We did a literature search 
(PubMed, Google) and were not able to identify a similar 
approach. This method of integrated data management is 
currently being applied in practice to design and implement an 
observational study regarding craniocerebral injuries in Munster, 
Germany. 

In general, stakeholders with different professional backgrounds 
need to work together in clinical studies. Data management for 
studies consumes a lot of resources [2] and requires contributions 
from principal investigators, data managers as well as statisticians. 
A key task is to design and implement a set of CRFs for each study. 
Currently, different tools are being applied for this task, in 
particular office tools like Microsoft Excel, EDC tools and statistics 
software. In clinical trials, CRFs are quite complex with 180 CRF 
pages on average [1]. Given this complexity of data models, the 
iterative nature of CRF design, and the need to synchronize 
different representations (office/EDC/statistics format), an auto- 
mated method obviously can help to reduce manual, error-prone 
transformation work. In contrast to generic extract-transforma- 
tion-load (ETL) tools, no customization of an ETL process is 
needed with our method, because it is based upon CDISC ODM. 

Another important aspect of the proposed method is semantic 
annotation of data models. Design, execution and analysis of 
clinical studies involves several stakeholders with different back- 
grounds. Despite the availabilty of international healthcare 
terminologies like SNOMED CT and LOINC for many years, 
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